MCSdb, a database of proteins residing in membrane contact sites

Organelles do not act as autonomous discrete units but rather as interconnected hubs that engage in extensive communication by forming close contacts called “membrane contact sites (MCSs)”. And many proteins have been identified as residing in MCS and playing important roles in maintaining and fulfilling specific functions within these microdomains. However, a comprehensive compilation of these MCS proteins is still lacking. Therefore, we developed MCSdb, a manually curated resource of MCS proteins and complexes from publications. MCSdb documents 7010 MCS protein entries and 263 complexes, involving 24 organelles and 44 MCSs across 11 species. Additionally, MCSdb orchestrates all data into different categories with multitudinous information for presenting MCS proteins. In summary, MCSdb provides a valuable resource for accelerating MCS functional interpretation and interorganelle communication deciphering.


Background & Summary
"All things are mutually woven together and therefore have an affinity for each other"-Marcus Aurelius, Meditations.Most biologists would agree with Aurelius' statement because connectivity is observed at every level of biology, occurring between biomolecules, cells, tissues and organisms [1][2][3][4][5] .Therefore, it is becoming increasingly evident that organelles do not act as autonomous discrete units but rather as interconnected hubs that engage in extensive communication by forming close contacts called "membrane contact sites (MCSs)" 6,7 .The MCS is defined as an area of close apposition (from 10 to 80 nm) between the membranes of two organelles that are physically connected via proteinaceous tethers but do not fuse (Fig. 1) 8,9 .Current studies on MCSs are moving toward the central stage in cell biology 9,10 .Multiple MCSs have been identified between virtually all organelles in eukaryotic cells and participate in various biological processes and intracellular signaling, such as autophagy, lipid metabolism, calcium homeostasis and organelle trafficking and remodeling [10][11][12][13] .Moreover, aberrant loss or gain of function of MCSs can contribute to various diseases, such as cancer, metabolic diseases and neurodegenerative disorders [14][15][16][17][18] .In a sense, studies on spatiotemporal coordination among organelles indicate the existence of a hidden world of cellular interorganelle communication networks connected by MCS waiting to be explored 19,20 .
As a key component of MCSs, the proteins residing in membrane contacts play a crucial role in maintaining and fulfilling functions specific to MCSs 9,21 .Understanding how these protein tethers and membrane contacts coordinate organelle function will redefine our view of the cell 14,22 .Recently, a growing number of MCS proteins have been identified and functionally characterized.For example, three Aster proteins (Aster-A, -B, -C) can be recruited to the plasma membrane (PM)-endoplasmic reticulum (ER) and facilitate nonvesicular plasma membrane to ER cholesterol transport 23 .The Sel1 L-Hrd1 protein complex is involved in ER-mitochondria (MT) crosstalk and can affect mitochondrial dynamics in brown adipocytes 24 .The protein complex consisting of SLPD1, SLPD2 and LIPA mediates lipid droplet (LD)-PM tethering in plant cells 25 .Loewen et al., identified a short conserved determinant called the FFAT motif.This motif interacts with the VAP protein family, which are conserved integral membrane proteins located in the ER.These proteins play a pivotal role in the formation and function of various ER-related MCSs [26][27][28] .Subsequently, a series of proteins that contain the FFAT motif were recognized as MCS proteins 29,30 .Several studies have begun to screen MCS proteins by combining traditional biochemical approaches (subcellular fractionation and pull-down) with mass spectrometry (MS)-based proteomics [31][32][33] .However, the limitations (e.g., destabilized contacts, contamination by other components) of such traditional biochemical approaches may lead to a large number of false-positive proteins being detected 9,34 .Nonetheless, some proximity labeling approaches combined with high-throughput proteomic analysis, such as BioID, Contact-ID, and Split-TurboID, have recently been developed for global mapping of MCS proteins and are promising for MCS proteomics studies [35][36][37][38][39] .
Although MCSs have received increasing attention and the proteins residing in MCSs have been extensively identified in the past few years [40][41][42] , an appropriative database for storing, integrating and reorganizing MCS proteins is still lacking.Therefore, we developed MCSdb, a manually curated database of experimentally supported MCS proteins and complexes from publications.The current version of MCSdb documents approximately 7000 manually curated MCS protein entries and 263 complexes with experimental evidence, involving 24 organelles and 44 MCSs across 11 species.Furthermore, MCSdb grades all MCS protein entries into 3 categories according to the confidence level of experimental evidence.MCSdb also provides multitudinous information to help query and analyze MCS proteins and complexes.To our knowledge, MCSdb is the first database specifically focusing on proteins located in MCS.We believe that this database will be invaluable in accelerating MCS functional research and interorganelle communication deciphering.Dataset of the MCS proteins and complexes is free available in Figshare 43 .

Methods
Data collection.The MCS proteins in the database were curated manually from the literature (before Jun.2023).First, we retrieved literature from PubMed, bioRxiv, Web of Science and Google Scholar using the following keywords: 'membrane contact site' , 'organelle communication' , 'organelle interaction' , 'mitochondria-associated membranes' , 'protein tether' , and 'proximity labeling' .All binary phrases consist of two organelles: 'endoplasmic reticulum-plasma membrane' , 'endoplasmic reticulum-Golgi' , 'endoplasmic reticulum-peroxisome' and 'endoplasmic reticulum-lipid droplet' (Fig. 2).Then, all retrieved publications were preliminarily reviewed by expert curators to filter out false-positive papers.According to several review articles 9,10 , the MCS is defined as an area of close apposition (from 10 to 80 nm) between two bi-or mono-layer membrane-bound organelles that are physically connected via proteinaceous tethers but do not fuse.And to be included as an MCS protein in MCSdb, there must be experimental confirmation that the protein is located at the MCS, or evidence showing that it can be recruited to the MCS, contributing to its formation or to the functions associated with the MCS.Additionally, the protein complexes located and acting in MCSs are recorded in MCSdb.Data organization.First, we distinguished MCSs by the connected organelles of an MCS (named ER-PM, ER-MT and MT-LD, etc.), and a total of 44 MCSs were defined.Then, we divided all MCS protein entries into different categories according to the MCS of the proteins located (Fig. 2).Meanwhile, we graded all documented MCS protein entries into 3 categories: low-throughput (LT) experimental-based methods, proximity labeling (PL)-based methods and mass spectrometric (MS)-based methods.LT-based methods represent the proteins identified and functionally characterized by low-throughput experimental methods, and there are two additional inclusion criteria for LT-based proteins: (1) Proteins cannot be solely identified through high-throughput experiments; (2) The number of MCS proteins identified by the literature source for a given protein is less than 10 (inspiration from the protein-protein interaction (PPI) databases' criteria: MINT 44 , mentha 45 , InWeb_InBioMap 46 ).PL methods represent the proteins identified by combining PL approaches with high-throughput proteomics.MS methods represent the proteins identified by combining traditional biochemical approaches with MS techniques.To enhance user ability to evaluate the reliability of MS-based data, we introduce a scoring system anchored in protein subcellular localization and protein-protein interaction (PPI) networks.We sourced interaction information for MS-based proteins from the String database and subcellular localization data from the Uniprot database 47 to ascertain if MS-based proteins and their interacting partners are situated within the MCS organelle.This system stratifies MS-based data into five confidence levels (L1 to L5), with detailed rules outlined on the "Help" page.

Data annotation.
To unify the proteins from multiple publications in authoritative reference databases, all MCS proteins were mapped to the NCBI gene database (Entrez ID) 48 and UniProt (UniProt ID) 47 .Five compounds involved in MCS complexes were mapped to the PubChem database (PubChem CID) 49 .Information of subcellular localization, cell line/tissue and descriptions of MCS proteins was manually curated from the literature (Fig. 2).Human and mouse gene expression data across different tissues were collected from Human Protein Atlas (HPA) (62 human tissues) 50 and the TISSUES 2.0 database (39 mouse tissues) 51 , respectively.Protein sequence data were collected from the UniProt database.Orthology information of MCS proteins was collected from five databases: EggNOG 52 , HOGENOM 53 , OrthoDB 54 , TreeFam 55 and GeneTree 56 .The PPIs involved in MCS complexes were extracted from the bioGRID database 57 .

Data Records
Recorded datasets.MCSdb is free available at Figshare 43 .it provides four types of datasets.The first dataset consisted of detail information of all MCS proteins (xlsx file), including the Entrez ID, protein name, Synonyms, UniProt ID, species, MCS location, and the references (Experimental Method, Cell line/Tissue, PMID and Description and evidences).The second dataset consisted of detail information of all complexes (xlsx file), including complex name, subunit number, species, MCS location, and the information about all subunits (protein names and UniProt ID).And the detail information of the compound subunit was also provided (names, PubChem CID, Formula and SMILES).The third dataset consisted of list of 44 MCS locations and along with their corresponding organelles (xlsx file).The last dataset consisted of detail information of all literatures, including PMIDs, DOI, journal name, authors, title, abstract and published time.The MCS proteins documented in the database were identified by various experimental methods and thus have different confidence levels.For example, some collected MCS proteins are high-confidence because they were well identified and functionally characterized by multiple low-throughput experimental methods, whereas some other proteins were only screened by the high-throughput method and require further experimental validation 9,34 .Therefore, after careful consideration of common perspectives from multiple review articles and the characteristics of the data 9,10,[35][36][37][38]58,59 , we graded all documented MCS protein entries into 3 categories according to the Data statistics. This urrent version of MCSdb documents 7010 manually curated MCS protein entries with experimental evidence (including 5985 entries detected by MS-based methods, 616 entries detected by LT-based methods and 409 entries detected by PL-based methods), referring to 24 organelles and 44 MCSs across 11 species; 263 complexes residing in MCSs are also included (Fig. 3a).The MCS category distributions of protein are shown in Fig. 3b-d.The protein entries detected by LT-based methods are distributed in multiple MCSs (Fig. 3b), most of which are located in ER-related MCSs (ER-MT: 213, ER-PM: 88 and ER-Endosome: 52, etc.).The protein entries detected by PL-based methods are divided into MCSs of ER-MT (277 proteins), ER-PM (66 proteins) and ER-Peroxisome (66 proteins).Over 95% protein entries detected by MS-based methods are located in the MCS of ER-MT (5729 proteins).All complexes, which were detected by LT-based methods, are distributed in multiple MCSs (data not shown).The organismal distribution of MCS proteins and complexes is shown in Fig. 3e-g.The protein entries detected by LT-based methods are distributed in 11 species, mostly human (281 proteins) and mouse (150 proteins) proteins (Fig. 3e).A total of 2253 human, 3476 mouse and 256 yeast protein entries were detected by MS-based methods (Fig. 3f).All the protein entries detected by PL-based methods are human proteins (data not shown).All the complexes are distributed in 11 species (Fig. 3g).The subunit number distribution of the complexes is shown in Fig. 3h, and most of the complexes are binary (153/263).

Data submission.
To acknowledge that the MCSdb collection may not include all proteins residing in MCSs, we offer a 'Submit' interface (https://cellknowledge.com.cn/mcsdb/submit.html)for researchers to submit new MCS proteins that have not yet been documented in the database.We will thoroughly review and update all submitted data in a timely manner.

technical Validation
The MCS protein entries in our database were carefully curated from peer-reviewed literature through manual selection, and we only included experimentally supported MCS proteins.In addition, all collected entries were evaluated and double-checked by at least two expert curators separately.Any discrepancies were resolved by consensus through discussion with the third expert curator.
To ensure the accuracy of our data, we collected detailed information on wet lab experiments used to identify MCS proteins, such as experimental methods and cell lines/tissues, from original articles.Additionally, we extracted original sentences from literature that explicitly described a protein's role and residence in the MCS, providing further evidence for the accuracy of our data.All of these supporting data from the literature can be easily accessed on the website of our database: https://cellknowledge.com.cn/mcsdb/.

Usage Notes
In addition to accessing the datasets via the Figshare repository 43 , MCSdb is also free available at https:// cellknowledge.com.cn/mcsdb/.Moreover, the website provides a user-friendly 'Help' page that presents a step-by-step tutorial to assist users in manipulating, querying, and browsing the MCSdb database.On this 'Help' page, we not only offer guidance on maintaining data quality but also provide specific instances of errors as examples for users to reference (refer to the file of all revised entries.xlsxavailable on the 'download' page).

Introduction to revised Data
During the data collection process, we continuously identified and corrected errors.To better safeguard users against encountering similar issues during literature searches and data collection, we have presented the errors found and their details during the revision process in the form of data tables on our database website (all revised entries.xlsx).In addition, to enhance user awareness and prevent similar mistakes, we have preserved all modified and obsolete entries in the database for user reference.Specifically, for modified entries, we offer two versions on the website: Version 1 and Version 2, with hyperlinks provided at the top of the detail pages for each version (Fig. 4).Version 1 is the original version, in which we have highlighted the specific modifications for user comparison and reference, while Version 2 presents the latest modified data.This approach allows users to easily see the changes made to the data.Moreover, we have separately displayed all obsolete entries in a list format on a dedicated "obsolete list" page, which includes three tables: the obsolete entries list, listing all obsolete protein entries; the table of obsolete Complex Entries list, listing all deleted complex entries; and the obsolete literatures list, listing all removed references.

Fig. 3
Fig. 3 Data statistics of MCSdb.(a) Overview of MCS protein entries and complexes.(b) Category distributions of protein entries detected by LT-based methods.(c) Category distributions of protein entries detected by PLbased methods.(d) Category distributions of protein entries detected by MS-based methods.(e) Organismal distribution of protein entries detected by LT-based methods.(f) Organismal distribution of protein entries detected by MS-based methods.(g) Organismal distribution of complexes.(h) Subunit number distribution of complexes.

Fig. 4
Fig.4 The differences between the two version pages of a modified entry.